VDJ_build()
function
Af_build() function
Af_plot_tree()
function
Af_compare_methods() function
Af_compare_within_repertoires() function
Af_compare_across_repertoires() function
This vignette provides a detailed explanation of the main functions
from the AntibodyForests package, including the VDJ_build()
function from the Platypus package.
These functions are designed for importing single-cell V(D)J sequencing
data from 10x Genomics, constructing lineage trees, comparing the
structure of these trees within and across repertoires, and
incorporating features of protein structure and evolutionary likelihood
from Protein Language Models. Throughout this vignette, the dataset
outlined by Neumeier
et al. (2022) and Kim et
al. (2022) will be used.
This quick start gives a short use case of AntibodyForests. 10x output of five mouse immunized with Ovalbumin (OVA) from Neumeier et al. (2022) are used to create a VDJ dataframe with Platypus. AntibodyForests is used to create lineage trees for each B cell clonotype, these trees are then clustered and the difference in topology between the clusters is analyzed.
# Import 10x Genomics output files into VDJ dataframe and only keep cells with one VDJ and one VJ transcript
# For each clone, trim germline sequence and replace CDR3 region in germline with most-frequently observed CDR3 sequence
VDJ_OVA <- Platypus::VDJ_build(VDJ.directory = "10x_output/VDJ/",
remove.divergent.cells = TRUE,
complete.cells.only = TRUE,
trim.germlines = TRUE)
# Build lineage trees for all clones present in the VDJ dataframe with the default algorithm
AntibodyForests_OVA_default <- Af_build(VDJ = VDJ_OVA,
construction.method = "phylo.network.default")# Plot one of the lineage trees, constructed with the `phylo.network.default' method
Af_plot_tree(AntibodyForests_object = AntibodyForests_OVA_default,
sample = "S1",
clonotype = "clonotype2")# Cluster the trees in the repertoire based on the euclidean distance between various metrics of each tree
out <- Af_compare_within_repertoires(input = AntibodyForests_OVA_default,
min.nodes = 5,
distance.method = "euclidean",
distance.metrics = c("mean.depth", "sackin.index", "spectral.density"),
clustering.method = "mediods",
visualization.methods = "PCA",
plot.label = T)
# Analyze the difference between the clusters based on the calculated sackin.index
plots <- Af_cluster_metrics(input = AntibodyForests_OVA_default,
clusters = out$clustering,
metrics = c("spectral.density"),
min.nodes = 5,
significance = T)VDJ_build() functionThe VDJ_build() function imports Cell
Ranger output into an R dataframe, priming the data for downstream
analyses. It is optimized for Cell Ranger v7 and older versions and is
suitable for B and T cell repertoires. Upon execution, the function
generates a VDJ dataframe where VDJ and VJ transcripts are paired (if
present) in each row, with each row representing a single cell. Post VDJ
dataframe construction, a table is displayed on the console, delineating
the number of filtered cells, contingent upon the
remove.divergent.cells and complete.cells.only
parameters.
For seamless integration with functions such as the
Af_build() function (discussed next), it is strongly
recommended to set the remove.divergent.cells and
complete.cells.only parameters to TRUE. Cells with more
than one VDJ or VJ transcript, referred to as ‘divergent cells’,
typically represent doublets that may arise during the single-cell
sorting, or show dual expression of kappa and lambda light chains. Such
occurrences are rare, accounting for only 0.2-0.5% of peripheral blood B
cells, as reported by Giachino et
al. (1995). Cells harboring both a single VDJ and a VJ transcript
may be included for comprehensive downstream analyses.
The output of 10x Genomics contains one output folder per
run/sample. In each sample folder, the following files should be present
in order to run the VDJ_build() function successfully. More
information on the outputs of 10x Genomics can be found on
this page.
filtered_contig_annotations.csvfiltered_contig.fasta file. These annotations
are required to obtain the trimmed V(D)J sequences.filtered_contig.fastafiltered_contig_annotations.csv.consensus_annotations.csvconsensus.fasta file.consensus.fastaconcat_ref.fastatrim.and.align parameter).all_contig_annotations.json file will be employed to obtain
the ‘L-REGION+V-REGION’ and ‘J-REGION’ annotations. The raw sequences
can now be trimmed using the ‘L-REGION+V-REGION’ start and ‘J-REGION’
end indices. If the all_contig_annotations.json file is not
found in the sample/run directory, the columns with the trimmed
sequences will remain empty and a warning message is printed.VDJ.directory
VDJ.sample.list parameter
should not be specified)."/path/to/VDJ/directory"VDJ.sample.list
VDJ.directory parameter should not be
specified).c("/path/to/sample1", "/path/to/sample2")remove.divergent.cells
TRUE, cells with more
than one VDJ transcript or more than one VJ transcript will be excluded
from the VDJ dataframe.FALSEcomplete.cells.only
TRUE, only cells with
both a VDJ transcript and a VJ transcript are included in the VDJ
dataframe.FALSEtrim.germlines
TRUE, the raw germline
sequence of each clone will be aligned with the consensus sequence of
that clone serving as the reference sequence, utilizing the
Biostrings::pairwiseAlignment() function. The alignment
conducted will be of the ‘global-local’ type, therebyy trimming the raw
germline sequence.FALSEfill.germline.CDR3
Biostrings::pairwiseAlignment() function. The
alignment conducted will be of the ‘global-local’ type, thereby
selecting the CDR3 region from the trimmed germline sequence. After the
alignment, the germline CDR3 sequence is replaced by the most frequently
observed CDR3 sequence, in order to obtain germline sequences that are
more likely to encode producible and productive antibodies.FALSEgap.opening.cost
Inf will result in a gapless alignment.10gap.extension.cost
Inf
will result in a gapless alignment.4In this example, divergent cells are filtered out, and only cells with both a VDJ and VJ transcript are included in the VDJ dataframe. The resulting VDJ dataframe contains 17,891 cells with both a single VDJ and VJ transcript. The number of cells filtered out due to the lack of VDJ and VJ transcripts, referred to as ‘incomplete cells’, is shown in the right column of the printed table. Since a cell can have a double VDJ or VJ transcript or may lack a VDJ or VJ transcript, such a cell can be included in both the counts of ‘divergent cells’ and ‘incomplete cells’. Note that, as a result, the sum of these counts may not exactly match the difference in the number of rows between the dataframes. Additionally, for each clone, the germline sequences are trimmed, and an additional germline sequence column is added in which the CDR3 region is replaced by the most frequently observed CDR3 sequence in that clone.
# Read in data, keep only complete, non-divergent cells and trim germline sequences
VDJ_OVA <- Platypus::VDJ_build(VDJ.directory = "10x_output/VDJ/",
remove.divergent.cells = TRUE,
complete.cells.only = TRUE,
trim.germlines = TRUE,
fill.germline.CDR3 = TRUE)| divergent.cells | incomplete.cells | |
|---|---|---|
| S1 | 360 | 708 |
| S2 | 295 | 503 |
| S3 | 79 | 387 |
| S4 | 183 | 380 |
| S5 | 526 | 1113 |
For illustration, the dummy VDJ dataframe below contains the data of clonotype 1-5 of S1: